The PCA show that there might be two groups and indeed as seen before, they coincide with the type of wine. The question is, are there only two groups or are there more? Can we classify the points in the dataset in another way? Several methods will be analyzed to try to find the best classification.

Partitional clustering

Let’s start with partitional clustering techniques which start from an initial cluster definition and proceed by exchanging elements between clusters until an appropriate cluster structure is found. However, we need to know how many clusters \(k\) we would like to build, how many are the optimal? Let’s compare with several criteria:

One can expect the total within sum of squares to stabilize from the optimal number of clusters and in advance since adding more clusters of the optimal ones doesn’t explain much more. However, there is not a clear stabilization point in this plot, so another methods must be used like silhouette or the gap statistic.

The idea is that the silhouette of a point ranges from -1 to 1 and indicates if its wrongly or correctly assigned to its own cluster. From the average silhouette plot one can extract the goodness of the classification, the higher the silhouette, the better. We can see that the optimal number of clusters is 4 and from the gap statistic (following plot), an optimal number of 6 is found. However, we need to take into account that the method for the gap statistic is to find the maximum value of the gap statistic or its first local maximum, which usually leads to very large values of \(k\). We can see that 6 is just on the limit of being a local maximum since its difference to 7 is not very big. Therefore, to avoid this uncertainty, the chosen value will be \(k=4\) from the average silhouette analysis.

Now that the optimal value is known, it’s time to perform the clustering. The first algorithm used is kmeans which minimizes the within sum of squares for a given \(k\). The two PC with k means clustering are shown in the following figure:

It can be seen that there is great overlap between the groups. Analyzing the silhouette one gets an average silhouette of 0.26, which is positive, indicating a good assignment but not incredibly good.

The next algorithm used is PAM, which changes the mean vector used in kmeans, sensitive to outliers, to the medoid (observation whose average distance to all the observations in the cluster is minimal). This algorithm is also known as k-medoids. The clusters are shown in the following plot:

The solution is different that the one obtained with kmeans and there is still big overlap between the groups. The average silhouette obtained is 0.2, which is worse than in kmeans.

The next algorithm used is CLARA, which is a version of k-medoids extended for large applications. The groups obtained are shown in the following figure:

The solution is again different as the ones previously obtained with an average silhouette of 0.19, the worst one so far.

Hierarchical clustering

The idea with hierarchical clustering is to start either with only one cluster and split it or to start with \(k=n\) total observations clusters and merge them (agglomerative algorithms), so that we don’t need to know \(k\) in advance. In order to analyze \(k\), one looks at distances, in our case Manhattan distance, in what is called a dendogram. However, due to the large number of observations present in the dataset, judging accurately from the dendogram is difficult so we will just use \(k=4\) as it was obtained in previous analysis.

We will start with agglomerative algorithms and the first method used is single linkage, with the solution shown in the following figure:

We can see that we get an average silhouette of 0.27, which is good in comparison with previous ones, but the distribution of clusters is awful, with a very big group and three singletons which can be seen in the PC plot and in the dendogram, with the size of the groups represented by the blue box (only one very big blue box and three singletons represented by vertical lines that are not seen). As can be seen in the dendogram, the single linkage distance calculation firstly separates individual points, so increasing k would probably only add more singletons.

A more sophisticated way of dealing with distances is complete linkage:

It is possible to see that there are now 4 groups with no singletons even if it can be seen in the dendogram that two of the groups are very small. The solution has an average silhouette of 0.29, which is close to the partitional clustering solutions.

Another approach is using average linkage which yields the following solution:

Again we find a solution with singletons and a very big group so, even if it has a high silhouette (0.32), it is a bad solution.

Finally, the last agglomerative algorithm is Ward linkage:

This solution is probably the best one among the agglomerative hierarchical clustering methods, with 4 more or less balanced groups with no singletons and an average silhouette of 0.21.

Now we will try with a divisive algorithm, that is, we start from only one cluster containing all the observations and split to reduce the distance (Manhattan). The whole splitting process (analogously to the merging process for the previous algorithms) can be seen in the dendogram and the solution retained was that of \(k=4\).

This procedure yields a realistic solution (with no singletons) with an average silhouette of 0.23.

Model based clustering

Finally, the last clustering approach would be model based clustering. This approach is not based on distances but on probabilities. It assumes that the observations are generated by different distributions with certain probability so that we can identify the population it belongs to using the Bayes Theorem.

There are several models choosing from (spherical, diagonal and ellipsoidal) which are restrictions on the covariance matrix and on equal or unequal volume (largest eigenvalues for each \(k\) identical or not), equal or unequal shape (matrix of eigenvalues for each \(k\) identical or not) and equal or unequal orientation (matrix of eigenvectors for each \(k\) identical or not). In total, there are 14 configurations. Their Bayesian Information Criteria (BIC) is analyzed for all of them in order to find the best one and the optimal \(k\). The lowest the BIC, the better. The following plot shows -BIC of all models, so that the model on top is the best:

The best model from Mclust is VVE (ellipsoidal, equal orientation) model with \(k=4\) components and the solution it yields is the following:

With the following densities:

The results obtained are realistic since they contain no singletons but one should carefully analyze the uncertainty of the classification or, in other words, remember that we are working with probabilities. So, ideally, we would like to classify an observation as 100% sure (probability) of belonging to a certain cluster. However, reality is that we might have multiple significant probabilities for each cluster for a given observation, as can be seen in the following plot:

It can be seen that while most of the points of cluster 1 (blue) belong to cluster 1 without much doubt because they have low probability of being in other clusters, this doesn’t happen for the other three of them, with probabilities up to almost 50% of being in other group. Keep in mind that 50% is the limit since the observation is assigned to a certain cluster taking the associated maximum probability, so an observation classified to group 4 can have as much as 50% probability of being in another group.

In general, the uncertainty of the points is shown in the following plot, where we see overlapping.

Finally, let’s compare its silhouette to that of the other models:

The average silhouette is 0.12 so other models performed better. In particular, the best model was complete linkage with an average silhouette of 0.29. In particular, in that model there were two very small groups while the other two were almost identical to the PC distinction between red and white wines. We should therefore expect some hidden variable that lead to dividing the white wines into three groups and the leave the red ones alone. This will be analysed in the factor analysis.

Factor analysis in the clusters obtained

As said, the best clustering was achieved with complete linkage clustering. Now, factor analysis will be performed on each individual cluster to see their different traits.

Principal Component Factor Analysis

First cluster (blue)

Let’s first analyze the explained percentage of variance of each eigenvalue:

The plot suggests four factors. We can apply the varimax rotation on them for interpretability and finally get the following results:

The first factor appears to be an index of the acidity (both in fixed acidity and in pH) and alcohol percentage of the wine, the alcoholic and acid strength of the wine (but not taste).

The second factor simply applies to sulfur dioxide present in the wine.

The third factor is a measure of sulphates and citric acid of the wine as well as its quality. General measure of the strength of its flavour.

Finally, the fourth factor is a measure of the volatile acidity and residual sugar, a measure of the sweetness of the wine.

In order to analyze the goodness of the factor model, let’s see its communalities and uniquenesses.

It can be seen that the factor model better explains total sulfur dioxide, fixed acidity and density while the worst explained variables are volatile acidity, residual sugar and citric acid.

The four factors obtained are uncorrelated and account for alcohol strength, sulfur dioxide, sweetness of the wine and taste strength of the wine. The representation of all the points into these 4 factors can be seen with factor scores:

Finally, one can see in the residuals that there are very minor correlations that the factor model leaves unexplained, being the one with alcohol and fixed acidity the only important one.

Second cluster (green)

Let’s first analyze the explained percentage of variance of each eigenvalue:

It suggests two factors for the model. Estimating the M matrix and after varimax rotation one gets the following variable importance for the two factors:

The first factor is greatly influenced by the density, chlorides and fixed acidity. Chlorides are related to the saltyness of the wine so in general this factor is telling us the amount of other substances present in the wine (that’s the importance of density) with saltyness and acidity being the main affected ones.

The second factor, on the other hand, is affected by alcohol and quality. Since higher alcohol percentage is associated to older wines, which are usually also associated to higher quality, this means that this factor is an index of the aging of the wine.

In order to analyze the goodness of the factor model, let’s see its communalities and uniquenesses.

It can be seen that the factor model better explains density, chlorides and alcohol while the worst explained variables are citric acid, free sulfur dioxide and residual sugar.

The two factors obtained are uncorrelated and account for saltyness/acidity of the wine and aging. The representation of all the points into these 2 factors can be seen with factor scores:

Finally, one can see in the residuals that there are very minor correlations that the factor model leaves unexplained, explaining most of the other ones.

Third cluster (orange)

Let’s first analyze the explained percentage of variance of each eigenvalue:

It suggests five factors for the model. Estimating the M matrix and after varimax rotation one gets the following variable importance for the five factors:

The first factor is clearly an index of the acidity.

The second factor is influenced by chlorides, sulphates and density, so it’s an index of the saltyness of the wine.

This factor is influenced by alcohol and quality, so again it looks like an index of the aging of the wine.

This factor is simply an index of the sulfur dioxide.

Finally, the last factor is an index of the sweetness of the wine.

In order to analyze the goodness of the factor model, let’s see its communalities and uniquenesses.

The variables best explained by the model are total and free sulphur dioxide as well as density while the worst explained are pH, quality and chlorides. Anyway, the model has high communalities in general.

The five factors obtained are uncorrelated and account for acidity, saltyness of the wine, the sulfur dioxide, the aging and the sweetness. The representation of all the points into these 5 factors can be seen with factor scores:

Finally, one can see in the residuals that there are very minor correlations that the factor model leaves unexplained, being the one with free and total sulfur dioxide as well as residual sugar and fixed acidity.

Fourth cluster (violet)

Let’s first analyze the explained percentage of variance of each eigenvalue:

It might suggest only one factor, but two will be taken because one yields very bad results. Estimating the M matrix and after varimax rotation one gets the following variable importance for the factors:

The factor is influenced by alcohol and quality (among others), so it’s a measure of the aging of the wine.

The second factor is a measure of the acidity.

In order to analyze the goodness of the factor model, let’s see its communalities and uniquenesses.

The variables best explained by the model are density, alcohol and fixed acidity and the worse, citric acid, sulphates and volatile acidity.

The two factors obtained are uncorrelated and account for aging and acidity of the wine. The representation of all the points into these 2 factors can be seen with factor scores:

Finally, one can see in the residuals that there are very minor correlations that the factor model leaves unexplained, explaining most of the other ones.

It is also surprising to see how this last factor model had only two factors but explained almost everything as can be seen in its residual plots, while other factor models for the other clusters with more factors got worse results.

In conclusion, we got that the main traits for each cluster are:

  • Cluster 1: Alcohol strength, sulfur dioxide, sweetness and taste strength
  • Cluster 2: Saltyness/acidity and aging
  • Cluster 3: Acidity, saltyness, sulfur dioxide, aging and sweetness
  • Cluster 4: Aging and acidity

As it can be seen, they have many in common, being cluster 1 the most different one with two unique factors: Alcohol and taste strength. Now, let’s compare this results with Principal Factor Analysis

Principal Factor Analysis on the clusters

Now we will analyze the factor model using principal factor analysis with the same amount of eigenvalues obtained previously for each eigenvalue.

First cluster

Let’s compare the differences between PCFA and PFA:

It can be seen that for cluster 1, all the factors are different except the second one, that seems more or less linear.

On the other hand, in terms of noise, both approaches are very similar:

The main difference between PCFA and PFA is that in PFA the communalities are smaller, which means that the PFA factor model explains worse. However, this change is not very big and the main ideas are kept. One could also obtain the factor scores in a similar way as done before and, if we look at the correlations it can be seen that the main ideas are still there, just changed the factor order.

Same applies for the residuals, where we can see that the PFA model is a bit worse with more unexplained variance.

It can be seen that the PFA model presents more correlation between the residuals, so it’s a little bit worse, as commented when looking at the communalities. But, anyway, it is similar to the PCFA results, as can be seen in the following plot:

Second cluster

Let’s compare the differences between PCFA and PFA:

It can be seen that for cluster 2, all the factors are linear with their correspondent PCFA versions, so they keep their physical meaning.

On the other hand, in terms of noise, both approaches are very similar:

The main difference between PCFA and PFA is that in PFA the communalities are smaller, which means that the PFA factor model explains worse. However, this change is not very big and the main ideas are kept. One could also obtain the factor scores in a similar way as done before and, if we look at the correlations it can be seen that the main ideas are still there.

Same applies for the residuals, where we can see that the PFA model is a bit worse with more unexplained variance.

It can be seen that the PFA model presents more correlation between the residuals, so it’s a little bit worse, as commented when looking at the communalities. But, anyway, it is similar to the PCFA results, as can be seen in the following plot:

Third cluster

Let’s compare the differences between PCFA and PFA:

It can be seen that for cluster 3, all the factors are linear, so they keep their meaning, except the second and third one.

On the other hand, in terms of noise, both approaches are very similar:

The main difference between PCFA and PFA is that in PFA the communalities are smaller, which means that the PFA factor model explains worse. However, this change is not very big and the main ideas are kept. One could also obtain the factor scores in a similar way as done before and, if we look at the correlations it can be seen that the main ideas are still there, just changed the factor order for the second and third factor.

Same applies for the residuals, where we can see that the PFA model is a bit worse with more unexplained variance.

It can be seen that the PFA model presents more correlation between the residuals, so it’s a little bit worse, as commented when looking at the communalities. But, anyway, it is similar to the PCFA results, as can be seen in the following plot:

Fourth cluster

Let’s compare the differences between PCFA and PFA:

It can be seen that for cluster 4, the two factors have a linear relationship between PFA and PCFA, so they share the same meaning.

On the other hand, in terms of noise, both approaches are very similar:

The main difference between PCFA and PFA is that in PFA the communalities are smaller, which means that the PFA factor model explains worse. However, this change is not very big and the main ideas are kept. One could also obtain the factor scores in a similar way as done before and, if we look at the correlations it can be seen that the main ideas are still there.

Let’s now look at the residuals of the PFA model, where we can see that the PFA model is a bit worse with more unexplained variance.:

It can be seen that the PFA model presents more correlation between the residuals, so it’s a little bit worse, as commented when looking at the communalities. But, anyway, it is similar to the PCFA results, as can be seen in the following plot:

Maximum Likelihood Estimation

Finally, there is another method that is Maximum Likelihood Estimation that makes use of the distribution of the data, relying heavily on gaussianity. It was already discussed in the previous part of the project that the data, even with the logarithm transformation, are not gaussian so they do not fulfill the requirements to apply this technique. An attempt was made with no real improvement over PFCA and PFA so we just mention it here for completeness and brevity.

Discussion and conclusions

In conclusion, several clustering configurations have been studied, being the optimal one complete linkage with 4 clusters. Then, factor analysis has been implemented on each of the 4 clusters to see if they had different traits. This analysis was approached with two different methods, Principal Factor Analysis and Principal Component Factor analysis, yielding a compatible joint solution between both of them:

If we remember that high values of the first PC were associated with red wines and lower values, with white wines, one can see then that the full red wine group has its own cluster (violet), while the white wines group have more diversity, with three different categories. The blue section would correspond to strong white wines while the main difference between clusters 2 and 3 would be their age. Since age weight for cluster 2 is positive, cluster 2 corresponds to old wines and cluster 3 to young wines. It can also be seen that there is way more variability in cluster 3 than in cluster 2 with five factors needed instead of two. This indicates that there is more variability in the young white wines market.

Separation between groups could be useful, for example, if we wanted to sell a new wine. We would need more information of the wines that are competitors rather than just “red” or “white” wines. In order to find which wines are our direct competitors, we could use this clustering and the factor analysis information to find the group our wine belongs to depending on its characteristics. Furthermore, it can be used to identify potential niches or opportunities where there are less competitors in the market (i.e. a cluster with few observations on it).

Extra: Task Topics 5 and 6

Task topic 5: Multidimensional Scaling

Choose at least eight entities, such as clothing brands, politicians, drinks, etc. With them, define with your own opinions a matrix of dissimilarities with which to perform multidimensional scaling. Then, perform the analysis and obtain conclusions from the results obtained.

The idea is to get 8 entities and relate them intuitively. Then, find the hidden variables that explain those similarities. I have chosen 8 drinks: wine, juice, water, beer, vodka, gin, tea, coffee. In order to build the similarities matrix, I’ve thought in their water composition and alcohol percentage. Since most of them, except the vodka and gin, are mostly water, they have high similarities with water and between them. However, juice is similar to wine because it’s sweet but not to beer because beer is not sweet and also has other compounds. Beer has a stronger taste, more similar to tea or coffee. Then, voda and gin are very alcoholic and very different from the rest but close between each other. With this in mind, the final matrix is the following:

##        wine juice water beer vodka gin  tea coffee
## wine   1.00  0.85  0.70  0.4   0.1 0.1 0.60   0.60
## juice  0.85  1.00  0.95  0.2   0.1 0.1 0.90   0.90
## water  0.70  0.95  1.00  0.7   0.3 0.3 0.95   0.95
## beer   0.40  0.20  0.70  1.0   0.1 0.1 0.70   0.70
## vodka  0.10  0.10  0.30  0.1   1.0 0.9 0.10   0.10
## gin    0.10  0.10  0.30  0.1   0.9 1.0 0.10   0.10
## tea    0.60  0.90  0.95  0.7   0.1 0.1 1.00   0.90
## coffee 0.60  0.90  0.95  0.7   0.1 0.1 0.90   1.00

Let’s check how many eigenvalues we need:

There are six positive eigenvalues and, as can be seen in the precision measure, only two of them are needed. Finally, let’s see how the different drinks are classified:

We can see how the first principal coordinate groups into spirits (very alcoholic) and non-spirits (low or no alcohol) while the second principal component differences into types of flavour: Sweet with juices and wines and very little of gin and vodka and bitter with beer but not that much of coffee and tea. Anyway, since in my classification of similarities I related coffee and tea very close to water, it makes sense that they are found to be close now since what this is really showing are the underlying variables that made up the dissimilarities matrix. Since I created the matrix, they indeed reproduce my thoughts when building it, which means that this procedure will be able to find the hidden variables in a real case scenario.

Task topic 6: Correspondence Analysis

Health is a contingency table corresponding to 6371 patients of a hospital with two variables. The first one is the age segment and the second one is the health status where VG, G, R, B, and VB mean “Very good”, “Good”, “Fair”, “Bad” and “Very bad”. It is requested to perform a correspondence analysis with this table and draw conclusions from the analysis.

Let’s start by looking at the health distribution in terms of age groups:

## [1] "Contigency table"
##         VG    G    R    B   VB  Sum
## 16-24  243  789  167   18    6 1223
## 25-34  220  809  164   35    6 1234
## 35-44  147  658  181   41    8 1035
## 45-54   90  469  236   50   16  861
## 55-64   53  414  306  106   30  909
## 65-74   44  267  284   98   20  713
## 75+     20  136  157   66   17  396
## Sum    817 3542 1495  414  103 6371

Both plots suggest that the two variables are related in some way. However note that some classes are more represented than others, e.g., there are more young people than old people and, in general, younger people tend to have better health. Let’s perform a chi squared test to see if this dependence of health on age is significative:

## 
##  Pearson's Chi-squared test
## 
## data:  N
## X-squared = 894.86, df = 24, p-value < 2.2e-16

The p-value is very small (2.2e-16), meaning that there is a significative dependence between the age and the health level. Now, we would like to understand the association between these two qualitative variables: age group and health group. In order to do that, we perform correspondence analysis with the library “ca”, getting the following results:

The interpretation is as follows:

  • If row points (age groups) are close, then these rows have similar conditional distributions across columns (health status).
  • If column points (health status) are close, then these columns have similar conditional distributions across rows (age groups).
  • If a row point is close to a column point, then that configuration suggests a particular deviation from independence.

So, we can see that the 16-24 and 25-34 groups are very close, so they have a similar distribution across health status, which might make sense since those groups are groups of young people and the effects of age start to manifest later in life. The rest of age groups are far away from each other so they behave differently.

On the other hand, all the health status are very far away from each other, which means that their distributions around age groups differ.

Finally, we can see that the age group 35-44 and the health status “Good” are very close to each other, which suggest that they have a special dependency. Same could be said of “R” group and 55-64 age.